W8 Lab Assignment


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
import warnings
warnings.filterwarnings("ignore")

sns.set_style('white')

%matplotlib inline

Ratio and logarithm

If you use linear scale to visualize ratios, it can be very misleading.

Let's first create some ratios.


In [2]:
x = np.array([1, 1, 1,1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1, 1, 1])
ratio = x/y
print(ratio)


[  1.00000000e-03   1.00000000e-02   1.00000000e-01   1.00000000e+00
   1.00000000e+01   1.00000000e+02   1.00000000e+03]

Plot on the linear scale using the scatter() function.


In [3]:
plt.scatter( np.arange(len(ratio)), ratio, s=100 )
plt.plot( [0,len(ratio)], [1,1], color='k', linestyle='--', linewidth=.5 ) # plot the line ratio = 1


Out[3]:
[<matplotlib.lines.Line2D at 0x174b4e434a8>]

Plot on the log scale.


In [4]:
plt.scatter( np.arange(len(ratio)), ratio, s=100 )
plt.yscale('log')
plt.ylim( (0.0001,10000) ) # set the scope the y axis
plt.plot( [0,len(ratio)], [1,1], color='k', linestyle='--', linewidth=.5 )


Out[4]:
[<matplotlib.lines.Line2D at 0x174b71893c8>]

What do you see from the two plots? Why do we need to use log scale to visualize ratios?

# TODO: provide your answers Graph 1: misinterpretation of the data. Graph 2: clear idea about the data that it is linearly related. Following reasons that we need to use log scales in order to visualize ratios 1. It helps us to visualize large range of values to be displayed without values being compressed down the graph. 2. Log scales are also useful went the data is time series.

Let's practice this using random numbers. Generate 10 random numbers between [0,1], calculate the ratios between two consecutive numbers (the second number divides by the first, and so on), and plot the ratios on the linear and log scale.


In [5]:
# TODO: generate random numbers and calculate ratios between two consecutive numbers
x = np.random.rand(10)
print(x)
ratio = [ i/j for i,j in zip(x[1:],x[:-1]) ]
print(ratio)


[ 0.63761511  0.10469875  0.97056734  0.12477944  0.97654204  0.77924561
  0.05397429  0.18957804  0.09342526  0.85091571]
[0.16420367512372158, 9.2700952373352017, 0.12856340433337238, 7.8261453147280964, 0.79796421860553013, 0.069264805033620147, 3.5123763991211376, 0.49280636583987364, 9.1079828832313154]

In [6]:
# TODO: plot the ratios on the linear scale
plt.scatter( np.arange(len(ratio)), ratio, s=100 )
plt.plot( [0,len(ratio)], [1,1], color='k', linestyle='--', linewidth=.5 )


Out[6]:
[<matplotlib.lines.Line2D at 0x174b71fa6a0>]

In [7]:
# TODO: plot the ratios on the log scale
plt.scatter( np.arange(len(ratio)), ratio, s=100 )
plt.yscale('log')
plt.plot( [0,len(ratio)], [1,1], color='k', linestyle='--', linewidth=.5 )


Out[7]:
[<matplotlib.lines.Line2D at 0x174b875f048>]

Log-bin

Let's first see what the histogram looks like if we do not use the log scale.


In [8]:
# TODO: plot the histogram of movie votes
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
plt.hist(movie_df['Votes'])


Out[8]:
(array([  3.12271000e+05,   5.00000000e+02,   1.35000000e+02,
          5.40000000e+01,   2.50000000e+01,   1.40000000e+01,
          3.00000000e+00,   6.00000000e+00,   1.00000000e+00,
          2.00000000e+00]),
 array([  5.00000000e+00,   1.51197800e+05,   3.02390600e+05,
          4.53583400e+05,   6.04776200e+05,   7.55969000e+05,
          9.07161800e+05,   1.05835460e+06,   1.20954740e+06,
          1.36074020e+06,   1.51193300e+06]),
 <a list of 10 Patch objects>)

As we can see, most votes fall in the first bin, and we cannot see the values from the second bin. How about plotting on the log scale?


In [9]:
# TODO: change the y scale to log
plt.hist(movie_df['Votes'])
plt.yscale('log')


Change the number of bins to 1000.


In [10]:
# TODO: set the bin number to 1000
plt.hist(movie_df['Votes'], bins=1000)
plt.yscale('log')


Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.


In [11]:
plt.hist( movie_df['Rating'], bins=range(0,11) )


Out[11]:
(array([     0.,   1002.,   4610.,  13237.,  31489.,  61615.,  91469.,
         77636.,  31693.,    260.]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 <a list of 10 Patch objects>)

Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:

Hint: since $10^{\text{start}} = \text{min_votes}$, $\text{start} = \log_{10}(\text{min_votes})$


In [12]:
# TODO: specify the edges of bins using np.logspace
bins = np.logspace( np.log10(min(movie_df['Votes'])), np.log10(max(movie_df['Votes'])), 20)

Now we can plot histgram with log-bin.


In [13]:
plt.hist(movie_df['Votes'], bins=bins)
plt.xscale('log')


Is this a correct plot? What's the problem? Can you draw a correct one

In [14]:
# TODO: correct the plot
plt.hist(movie_df['Votes'], bins=bins, normed=True)
plt.xscale('log')
plt.yscale('log')


KDE

Import the IMDb data.


In [15]:
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()


Out[15]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13

We can plot histogram and KDE using pandas:


In [16]:
movie_df['Rating'].hist(bins=10, normed=True)
movie_df['Rating'].plot(kind='kde')


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x174b74cbcf8>

Or using seaborn:


In [17]:
sns.distplot(movie_df['Rating'], bins=10)


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x174b7584400>

Can you plot the histogram and KDE of the log of movie votes?


In [18]:
# TODO: implement this using pandas
logs = np.log(movie_df['Votes'])
logs.hist(bins=10, normed=True)
logs.plot(kind='kde')
plt.xlim(0, 25)


Out[18]:
(0, 25)

In [19]:
# TODO: implement this using seaborn
sns.distplot(logs, bins=10)


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x174bae9b748>

We can get a random sample using pandas' sample() function. The kdeplot() function in seaborn provides many options (like kernel types) to do KDE.


In [20]:
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)

sample_sizes = [10, 50, 100, 500, 1000, 10000]
for i, N in enumerate(sample_sizes, 1):
    plt.subplot(2,3,i)
    plt.title("Sample size: {}".format(N))
    for j in range(5):
        s = movie_df['Rating'].sample(N)
        sns.kdeplot(s, kernel='gau', legend=False)


Regression

Remember Anscombe's quartet? Let's plot the four datasets and do linear regression, which can be done with scipy's linregress() function.

TODO: display the fitted equations using the text() function.


In [21]:
X1 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
Y1 = [8.04, 6.95, 7.58,  8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

X2 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
Y2 = [9.14, 8.14, 8.74,  8.77, 9.26, 8.10, 6.13, 3.10, 9.13,  7.26, 4.74]

X3 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
Y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15,  6.42, 5.73]

X4 = [8.0,  8.0,  8.0,   8.0,  8.0,  8.0,  8.0,  19.0,  8.0,  8.0,  8.0]
Y4 = [6.58, 5.76, 7.71,  8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

data = [ (X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4) ]

plt.figure(figsize=(10,8))

for i,p in enumerate(data, 1):
    X, Y = p[0], p[1]
    plt.subplot(2, 2, i)
    plt.scatter(X, Y, s=30, facecolor='#FF4500', edgecolor='#FF4500')
    slope, intercept, r_value, p_value, std_err = ss.linregress(X, Y)
    plt.plot([0, 20], [intercept, slope*20+intercept], color='#1E90FF') #plot the fitted line Y = slope * X + intercept
    
    # TODO: display the fitted equations using the text() function.
    plt.text(2, 11, r'$Y = {:1.2f} \cdot X + {:1.2f}$'.format(slope,intercept))

    plt.xlim(0,20)
    plt.xlabel('X'+str(i))
    plt.ylabel('Y'+str(i))


Actually, the dataset is included in seaborn and we can load it.


In [22]:
df = sns.load_dataset("anscombe")
df.head()


Out[22]:
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33

All four datasets are in this single data frame and the 'dataset' indicator is one of the columns. This is a form often called tidy data, which is easy to manipulate and plot. In tidy data, each row is an observation and columns are the properties of the observation. Seaborn makes use of the tidy form.

We can show the linear regression results for each eadataset. Here is the example:


In [23]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", size=4,
           scatter_kws={"s": 50, "alpha": 1})


Out[23]:
<seaborn.axisgrid.FacetGrid at 0x174b9d7beb8>

What do these parameters mean? The documentation for the lmplot() is here.

# TODO: explain what the parameters (x, y, col, hue, etc.) mean? # Change the values of these parameters and see the results. Parameters 1. x: The data to be plotted on x-axis. 2. y: The data to be plotted on y-axis. 3. col: This provides the ordering of the facet for visualization. 4. hue: This parameter takes care of the color of the graph. 5. palette : seaborn color palette or dict

In [24]:
sns.lmplot(x="y", y="x", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", size=4,
           scatter_kws={"s": 25, "alpha": 0.8})


Out[24]:
<seaborn.axisgrid.FacetGrid at 0x174b7484da0>

2-D scatter plot and KDE

Select movies released in the 1990s:


In [25]:
geq = movie_df['Year'] >= 1990
leq = movie_df['Year'] <= 1999
subset = movie_df[ geq & leq ]
subset.head()


Out[25]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
23 'N Sync TV 1998 7.5 11
33 't Zal je gebeuren... 1998 6.0 7
34 't Zonnetje in huis 1993 6.1 148
42 .COM 1999 3.8 5

We can draw a scatter plot of movie votes and ratings using the scatter() function.


In [26]:
plt.scatter(subset['Votes'], subset['Rating'])
plt.xlabel('Votes')
plt.ylabel('Rating')


Out[26]:
<matplotlib.text.Text at 0x174bb95d3c8>

Too many data points. We can decrease symbol size, set symbols empty, and make them transparent.


In [27]:
plt.scatter(subset['Votes'], subset['Rating'], s=20, alpha=0.6, facecolors='none', edgecolors='b')
plt.xlabel('Votes')
plt.ylabel('Rating')


Out[27]:
<matplotlib.text.Text at 0x174bb969f28>

Number of votes is broadly distributed. So set the x axis to log scale.


In [28]:
plt.scatter(subset['Votes'], subset['Rating'], s=10, alpha=0.6, facecolors='none', edgecolors='b')
plt.xscale('log')
plt.xlabel('Votes')
plt.ylabel('Rating')


Out[28]:
<matplotlib.text.Text at 0x174baefc0f0>

We can combine scatter plot with 1D histogram using seaborn's jointplot() function.


In [29]:
sns.jointplot(np.log(subset['Votes']), subset['Rating'])


Out[29]:
<seaborn.axisgrid.JointGrid at 0x174bb052208>

Hexbin

There are too many data points. We need to bin them, which can be done by using the jointplot() and setting the kind parameter.


In [30]:
# TODO: draw a joint plot with hexbins and two histograms for each marginal distribution
sns.jointplot(np.log(subset['Votes']), subset['Rating'], kind='hexbin')


Out[30]:
<seaborn.axisgrid.JointGrid at 0x174b9d22940>

KDE

We can also do 2D KDE using seaborn's kdeplot() function.


In [31]:
sns.kdeplot(np.log(subset['Votes']), subset['Rating'], cmap="Reds", shade=True, shade_lowest=False)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x174ba4d4358>

Or using jointplot() by setting the kind parameter.


In [32]:
# TODO: draw a joint plot with bivariate KDE as well as marginal distributions with KDE
sns.jointplot(np.log(subset['Votes']), subset['Rating'], kind='kde', shade_lowest=False)


Out[32]:
<seaborn.axisgrid.JointGrid at 0x174ba44a1d0>